General instructions for all assignments:
R Markdown file (named as: [AndrewID]-Lab09.Rmd – e.g. “sventura-Lab09.Rmd”) to the Lab 09 submission section on Blackboard. You do not need to upload the .html file.This week’s oral evaluation graphic: Star Wars Character Network
(See the link above for details on how this was created.)
Sam Says: It’s important to avoid inserting your own commentary when discussing graphics Your job, as a statistician / data scientist / quantitative analyst, is to interpret the graph for the viewer. Discuss the facts presented via the data/graphic; avoid overstepping your boundaries by inserting your own opinions into what should be a presentation of facts.
Reminder, the following strategy is ideal when presenting graphs orally:
First, explain what is being shown in the graph. What is being plotted on each axis? What do the colors correspond to? What are the units (if applicable)? What are the ranges of different variables (if applicable)? Where does the data come from (if applicable)?
Next, explain the main takeaway of the graph. What do you want the viewer to understand after having seen this graph?
If applicable, explain any secondary takeaways or other interesting findings.
Finally, for this class, but not necessarily in general: Critique the graph. What do you like/dislike? What would you keep/change? Etc.
(5 points)
Use your theme. Do not directly copy the instructors’ theme.
For all graphs, do not use the default color scheme.
Set the options in the header of your .Rmd file so that all code is hidden. To do this, change code_folding: show to code_folding: hide at the top of your file.
(8 points each)
Networks, igraph, and ggnetwork
The igraph package makes network analysis easy! Install and load the igraph and igraphdata packages. Load the UKfaculty network into R:
#install.packages("igraph")
#install.packages("igraphdata")
library(igraph)
library(igraphdata)
data(UKfaculty)
Try out some of the built-in functions in the igraph package in order to summarize the UK faculty network. How many nodes (“vertices”) are in the network? (Use vcount().) How many edges (“links”) are in the network? (Use ecount().)
The igraph package has some built-in functions for analyzing specific nodes/vertices in the network. Use the neighbors() function to find both the in-degree and out-degree of the 11th UK faculty member. How many friends does faculty #11 claim to have? How many other faculty members claim that #11 is their friend? (See Professor Rodu’s notes for how to use the neighbors() function.)
Write a function, called get_degree, that calculates the in-degree or out-degree (number of neighbors) that a given node in a network has. Your function should take three inputs: the node index/number, the network itself, and the degree type (in or out). The code is started for you below.
# node: The node index in the network
# network: The network to use in the degree calculations
# type: The degree type; must be either "in" or "out"
get_degree <- function(node, network, type) {
# Your code here
}
get_all_degrees, that calculates the in- or out-degree of an entire network. Your function should call the function you wrote in part (c).# network: The network to use in the degree calculations
# type: The degree type; must be either "in" or "out"
get_all_degrees <- function(network, type) {
# Your code here
}
Apply your function to the UK faculty network dataset twice – once for each degree type (in or out). Create a new, three-column data frame that contains the results. One column should contain the in-degrees, one should contain the out-degrees, and a third column should contain the node index/number.
Create a scatterplot of the in-degrees vs. the out-degrees, and change the point type to be their node index/number. Describe the graph. Which nodes are “overconfident” in their popularity (i.e. they claim to have many friends, but not as many others claim that they are friends with this person)? Which nodes are “underconfident” (i.e. many others claim that they are friends with this person, but this person does not claim to have many friends)?
Finally, let’s actually visualize the network itself! Install and load the ggnetwork package. This package is very new – it was just released on March 28th, 2016. You can read more about how to use it in this article on the package. Visualize the UK faculty network. The code below should get you started (be sure to add a title, remove the x- and y-axis labels and tick marks, and adjust the legend as necessary).
Note that you’ll also need to install and load the intergraph package. Finally, depending on which version of ggplot2, R, and RStudio you’re running (new versions of all three were just released), you may need to update your packages to the latest versions that are on GitHub (see the first commented line of code below):
#devtools::install_github("briatte/ggnetwork")
#devtools::install_github("mbojan/intergraph")
library(ggnetwork)
uk_data <- fortify(UKfaculty)
node_degrees <- igraph::degree(UKfaculty, mode = "in")
uk_degrees <- node_degrees[match(uk_data$vertex.names, 1:length(node_degrees))]
uk_data$degree <- uk_degrees
ggplot(uk_data,
aes(x, y, xend = xend, yend = yend)) +
geom_edges(arrow = arrow(length = unit(0.3, "lines")),
aes(color = as.factor(Group)), alpha = 0.5) +
geom_nodetext(aes(label = vertex.names, size = degree * 1.5),
color = "blue", fontface = "bold")
Describe the network. How many groups (“cliques”) do there appear to be? In the initial graph from part (g), to what does the size of the nodes correspond? To what does the color of the edges correspond? Is the graph a directed or an undirected graph?
Further adjust the graph as you see fit in at least two more ways. See the article at the link above for ideas here. Describe what changes you made.
Identify Potential Datasets for Static Graphics Group Project
For the static graphics group project / poster presentation, you get to pick your own dataset!
Here are some repositories with many, many datasets to choose from:
You do not have to pick a dataset from one of these places. These are just suggestions.
Your data must be contain a mix of categorical and continuous variables and be complex enough that you can create 8 interesting graphs (so datasets with only a few variables will not work).
You CANNOT use any of the datasets that were used in any previous assignments in this course or any other course you have taken. You must use a dataset that everyone in your group has never worked with before.
I’m strongly encouraging groups to pick different datasets, so that no group is using the same dataset. If you choose a dataset that another group has already chosen, I may ask you to switch.
Be sure to read the guidelines on the graphics below. These will certainly influence what datasets you choose.
In the group project itself, you will have some restrictions on the graphs you are allowed to use, which will certainly influence your choice of dataset. The restrictions are below:
(Note: We will be covering maps, time series, and text analysis in the next two weeks.)
There may be additional restrictions as well.
(28 points)
List at least four potential datasets that you can use for your group project. Include the name of the dataset, the source (e.g. UCI Machine Learning Repository), and a link.
You are not bound to using one of these datsets for your group project. Your group will finalize this choice next week.
The lab is due on Saturday at 6:30pm, not Friday. You should use this extra time to coordinate with your group members and find four potential datasets that you all agree on!